This project analyses part of the baseball data. This dataset includes statistics data for pitching, hitting and fielding. We will be analysing the batting data provided within this dataset.
Here is the short explanation of the meaning of column
2.2 Batting Table
playerID Player ID code
yearID Year
stint player's stint (order of appearances within a season)
teamID Team
lgID League
G Games
AB At Bats
R Runs
H Hits
2B Doubles
3B Triples
HR Homeruns
RBI Runs Batted In
SB Stolen Bases
CS Caught Stealing
BB Base on Balls
SO Strikeouts
IBB Intentional walks
HBP Hit by pitch
SH Sacrifice hits
SF Sacrifice flies
GIDP Grounded into double plays
Loading Pandas, Scipy, and numpy libraries. I will also be using Seaborn for creating visualaisations in this project.
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Load a csv file into pandas dataframe using read_csv function provided by pandas library Create a pandas dataframe form the csv file data. We will exclusively be looking at the batting table for the purpose of this project.
bt = pd.read_csv("Batting.csv")
Check if the data got loaded properly by looking at the sample of the data using dataframe.head function from pandas.
bt.head()
print bt.columns
The describe functions gives a summary statistics of all the features of the data.We can see the minimum , maximum values of all the numerical data provided here. We can see that yearID is also included in this description, but year should ideally be a categorical variable. We can fix that later in the project. If we force the describe funciton to print all the included features, it prints information about the categorical variables as well. As we can see in the example below, playerId description is also provided. We can see that playerID unique values are 18659 , which is fewer than the total count of column, we can safely assume that there are multiple entries for different players. This is confirmed with the freq data for that column. Looking at this description we can see that a lot many columns in this dataframe have NA values. We will handle this in the project further.
bt.describe(include = 'all')
We can get the correlation of the different numerical features with the corr function from pandas. This gives the correlation between all the features in the dataframe. But looking at the numerical information is very difficult. IT would be better if we could map it on a visual plot. We can see how to do that further. By default this function uses pearsons rho for calculating the correlation.
# We will only take batting into account.
corr = bt.corr(method = 'pearson')
print corr
We can use the seaborn heatmap function to better visualize the correlation between different features. We can see some correlation are very strong, like between at bats and number of hits. If we were to continue this dataset further to do predictive analysis, we can use methods like PCA too reduce the redundancy between the data features. Most of the data here looks to be positively correlated.
corr = sns.heatmap(corr,xticklabels=corr.columns.values,yticklabels=corr.columns.values)
As we had observed before there were too many NA's in entries in data columns. There are number of ways we can handle this situation. One way is to remove all the row entries which have at leeast one NA value. but we can lose some crucial inforamtion from some columns all together. Another method is, while working on data exploration, we can remove NA values while they are under observation and make a decision after that to delete the NA values or not.
Below we will delete all the NA entries from the dataset and take a look at the correlation heatmap again.
df = bt.dropna()
df.describe(include = 'all')
ncorr = df.corr()
sns.heatmap(ncorr,xticklabels=ncorr.columns.values,yticklabels=ncorr.columns.values)
#corr
plt.tight_layout()
plt.show()
#ncorr.get_figure()
We can see above that correlation between the data is maintained even after removing all the rows with NA values. But the size of the dataframe reduces cosiderably, this means we have lost a big chunk of data.
df.dtypes
We can see that the data types of most of the features is float as expected. But datatype for the teamId and leagueId is given as object. We can convert it to category type which will turn it into categorical variable. Same with yearId, It is given as integer number, but it is not very useful as an integer. Hence we will be converting it into a category variable.
df['playerID'] = df['teamID'].astype('category')
df['teamID'] = df['teamID'].astype('category')
df['lgID'] = df['lgID'].astype('category')
df['yearID'] = df['yearID'].astype('category')
print df.dtypes
#df['lgID']
df.describe(include = 'all')
ncorr = df.corr()
sns.set()
sns.heatmap(ncorr,xticklabels=ncorr.columns.values,yticklabels=ncorr.columns.values)
#corr
plt.tight_layout()
plt.show()
We can use pairplot to get a visual idea about the relationship between the data entries for all features. Based on the plot shown below we can see that all the data in this dataset is highly correlted, this confirms the heatmap that we saw earlier. We can also wee that the data is left skewed at many places. We can use the preprocessing functionality from sklearn library to normalize the data, but here we will be using apply function and apply normalization over all the columns of the dataframe.
sns.pairplot(df)
We can try and normalize data, which will be better to use when we are running our analytics solutions. Since normalization needs either numerical data, we will have to drop categorical data. As we can see the data a bit more normlaized.
df_normalized = df.drop(['playerID','yearID','teamID','lgID'], axis =1)
df_normalized = df_normalized.apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x)))
df_normalized['playerID'] = df['playerID']
df_normalized['yearID'] = df['yearID']
df_normalized['teamID'] = df['teamID']
df_normalized['lgID'] = df['lgID']
sns.pairplot(df_normalized)
df_normalized.head()
Since we have seen the chart of all the correlations and pairplot distribution, we can take a look at the data in a multivariate platform.
We can use joint plot form seaborn to understand the relationship between the homeruns made by the player and the runs for that entry. The joint plot function gives us the distribution of the variables and the fits a linear model for the two variables. As we can see from the plot below, regular linear model might not explain the spread of the data.
sns.set(style="darkgrid", color_codes=True)
sns.jointplot(x = 'HR', y='R', data = bt, kind="reg",color="g", size=7, dropna = True)
In the plot below we can see the stripplot of the games played per league over the year. what is interesting with this plot is. a lot of data for different leagues would have been lost if we considered the filtered dataframe with no NA values. We can see that in the 2nd plots shown below. Hence, we have to be careful while filtering data , we might endup losing crucial information about a lot many features.
sns.stripplot(x = 'lgID', y= 'yearID', data = bt, jitter = True)
df['yearID'] = df['yearID'].astype('int64')
sns.stripplot(x = 'lgID', y= 'yearID', data = df, jitter = True)
df['yearID'] = df['yearID'].astype('category')
We can see the number of hits over year for the different leagues . Here we have used unfiltered data again, since we would have lost leagues information using filtered data. It might also be the case that some of the leagues were older and dissolved, since there is no data on them for last couple of decades.
sns.lmplot(x = 'yearID', y = 'H', data = bt, col = 'lgID', hue = 'lgID')
We can see if there is a relationship between the homerun number and the total number of runs. We can see that there is a fairly linear relationship between the two variables. We are plotting this values using the normalized data. Since the data variation amount the entries in a column is not very high the distribution remains almost the same.
plt.scatter(df_normalized['R'], df_normalized['HR'], alpha = 0.2)
grouped = bt.groupby(['playerID']).mean()
grouped.describe(include = 'all')
From the plot below we can see that the number of mean homeruns for players has steadily increased over the years. This can be attributed to better playing stratergies, better equipment or number of other factors support players in the modern times.
plt.scatter(grouped['yearID'],grouped['HR'])
In the plot below we can see the trend for diffrent features over each decade starting from 1871 till 2015. We are using the original loaded data her so as not to miss out on rows with individual NA values. The mean value in the last decade seems to be reducing because we are not including the entire decade for consideration there. We can add more features to this plot to understand the relationship between the different variables if needed.
bt['teamID'] = bt['teamID'].astype('category')
bt['lgID'] = bt['lgID'].astype('category')
bt['yearID'] = bt['yearID'].astype('category')
nbt =bt.sort_values('yearID')
bins = np.arange(1871,2015,10)
ind = np.digitize(bt['yearID'],bins)
nbt = nbt.groupby(ind).mean()
plt.title('Mean data over decades', color='black')
plt.plot(nbt['H'], label = 'Number of Hits ', color = 'b')
plt.plot(nbt['HR'], label = 'Homeruns ', color = 'g')
plt.plot(nbt['R'], label = 'Runs', color = 'r')
plt.plot(nbt['2B'], label = 'Dooubles', color = 'c')
plt.plot(nbt['3B'], label = 'Triples', color = 'm')
plt.plot(nbt['RBI'], label = 'Runs batted in ', color = 'y')
plt.plot(nbt['SB'], label = 'Stolen bases', color = 'k')
plt.plot(nbt['CS'], label = 'Caught Stealing', color = '#bcdfde')
plt.plot(nbt['BB'], label = 'Base on Balls', color = '#acdeff')
plt.plot(nbt['SO'], label = 'Strikeouts', color = '#cdfeee')
#plt.plot(nbt[''], label = '')
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.show()
We can see below, that number of strikeouts were infact very low till about 1955, then there is a sudden increase in the value there. I got curious about this phenomenon, and not knowning much about the game searched online for some information about changes in the game in mid 50's . And it looks like there was a major change as stated here https://en.wikipedia.org/wiki/Major_League_Baseball_relocation_of_1950s%E2%80%9360s between 50 and 60. This phenomenon is not just observed with strikeouts, we can see this with this features as well. as seen in the 2nd plot.
plt.scatter(y = df['SO'], x = df['yearID'], alpha = 0.2)
plt.scatter(y = df['H'], x = df['yearID'], alpha = 0.2)
This is a very big data set with multiple files. If we move to possibily predicting the outcome of a game using this data. We can do so combining multiple files. We might even be able to predict performance of a player for the seasonusing this dataset. One this to note however is. given the last plots that we saw, it might not be useful to include data from before 1955 to predict outcome of modern games. but that woudl be a speculation which can only have a conclusive answer after we run some analytics on the given dataset.